Skip to content

Conversation

@AnilSorathiya
Copy link
Contributor

@AnilSorathiya AnilSorathiya commented Sep 15, 2025

Pull Request Description

What and why?

  • Integrate Deepeval scorers into ValidMind as first-class scorers under a dedicated deepeval namespace, enabling evaluation of LLM outputs with standardized metrics.
  • Add Deepeval-based LLM scorers (e.g., Hallucination, Contextual Precision/Recall, Summarization, Task Completion) and supporting demo notebook for end-to-end usage.
  • **Maintenance: update .gitignore for *.deepeval artifacts; remove deprecated/duplicate tests (e.g., Geval); improve plots (e.g., boxplot) and examples.

How to test

  • Notebook validation:
    • Run notebooks/code_sharing/deepeval_integration_demo.ipynb end-to-end; verify Deepeval scorers run, log results, and produce expected figures/tables.
    • Run notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb with validmind/datasets/llm/agent_dataset.py to exercise Task Completion and related scorers.
  • Scorer/runtime tests:
    • Run pytest for scorer interfaces and decorator behavior:
      • pytest -q tests/test_scorer_decorator.py
      • pytest -q tests/test_unit_tests.py and any added tests tagged for Deepeval/scorers.

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

@AnilSorathiya AnilSorathiya added the enhancement New feature or request label Oct 7, 2025
@AnilSorathiya AnilSorathiya marked this pull request as ready for review October 7, 2025 16:04
@cachafla
Copy link
Contributor

Some feedback for notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb:

Instead of We'll use our comprehensive banking test dataset to evaluate our agent's performance across different banking scenarios. I'd suggest:

We'll use a sample test dataset to evaluate our agent's performance across different banking scenarios.

For validmind.scorer.llm.deepeval.TaskCompletion, how does a user control the default verbosity of DeepEval tests? They print a lot of things.


For Let's add box plot for task completion score. there should an explanation that the previous test has added a new column TaskCompletion_score as part of assign_scores and that's what we're going to use for the box plot. We should also explain that these columns are added because of how our scorer return values are processed.

@cachafla
Copy link
Contributor

For notebooks/code_sharing/deepeval_integration_demo.ipynb:

Does the %pip install -q validmind require [all] like the other notebook?

For Compute metrics using ValidMind scorer interface:

There should be a short explanation clarifying how the test knows where the input and output columns are declared. That way the end user will know how the input dataset is being used.

Alternatively we can also pass input_column and actual_output_column explicitly so we self document how the scorers work, even though we match the default argument values.

I tend to think that this could apply for all uses of scorers in demo notebooks actually 🤔.

Towards the end of the notebook, on the section Integrate with ValidMind:

This code is not needed because vm is already initialized at the beginning:

    # Initialize ValidMind
    vm.init()
    print("ValidMind initialized")

At the end of the notebook there's a cell with this text: FIXED VERSION. What is that?


The last cell runs each of the custom_metrics with:

result = metric.measure(test_case)

How does that integrate with VM? Via scorers or tests? As a user I wouldn't know how to bring those results from GEval tests to a ValidMind document.

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 🙌 apologies for the delay reviewing this.

My suggestion would be put the golden datasets + GEval in separate notebooks. This notebook has everything we need but it can feel heavy but also trying to do multiple things at the same time.

Specifically, I'd recommend:

  • Leaving this notebook as a demonstration of integration with DeepEval LLMTestCase and SummarizationMetric
  • Another notebook that demonstrates how to use LLMAgentDataset with the Golden dataset from DeepEval
  • Another notebook that demonstrates how to use GEval with VM scorers and/or VM tests

Specifically for Golden I feel like we should define what is the actual use case we want to demonstrate here. Geval has a more clear objective here but the Golden examples with the mock LLM usage feel a bit out of place.

To expedite merging ths PR we can probably update the notebook to not have the golden datasets + GEval and come back to that on a follow up PR.

Thoughts?

@juanmleng
Copy link
Contributor

Looking good 🙌 apologies for the delay reviewing this.

My suggestion would be put the golden datasets + GEval in separate notebooks. This notebook has everything we need but it can feel heavy but also trying to do multiple things at the same time.

Specifically, I'd recommend:

  • Leaving this notebook as a demonstration of integration with DeepEval LLMTestCase and SummarizationMetric
  • Another notebook that demonstrates how to use LLMAgentDataset with the Golden dataset from DeepEval
  • Another notebook that demonstrates how to use GEval with VM scorers and/or VM tests

Specifically for Golden I feel like we should define what is the actual use case we want to demonstrate here. Geval has a more clear objective here but the Golden examples with the mock LLM usage feel a bit out of place.

To expedite merging ths PR we can probably update the notebook to not have the golden datasets + GEval and come back to that on a follow up PR.

Thoughts?

These comments look very sensible to me. Totally agree!

@AnilSorathiya
Copy link
Contributor Author

For validmind.scorer.llm.deepeval.TaskCompletion, how does a user control the default verbosity of DeepEval tests? They print a lot of things.

This is known issue in Deeleval.
Generally you can pass verbose_mode = False but quite a few tests ignore it. I am passing

    metric = TaskCompletionMetric(
        threshold=threshold,
        model=model,
        include_reason=True,
        strict_mode=strict_mode,
        verbose_mode=False,
    )

These tests ignore the verbose_mode:
Answer Relevancy
Faithfulness
Contextual Precision
Contextual Recall
Contextual Relevancy
...

@AnilSorathiya
Copy link
Contributor Author

For Let's add box plot for task completion score. there should an explanation that the previous test has added a new column TaskCompletion_score as part of assign_scores and that's what we're going to use for the box plot. We should also explain that these columns are added because of how our scorer return values are processed.

Added separate section in the notebook:

## Scorers in ValidMind

Scorers are evaluation metrics that analyze model outputs and store their results in the dataset. When using `assign_scores()`:

- Each scorer adds a new column to the dataset with format: {scorer_name}_{metric_name}
- The column contains the numeric score (typically 0-1) for each example
- Multiple scorers can be run on the same dataset, each adding their own column
- Scores are persisted in the dataset for later analysis and visualization
- Common scorer patterns include:
  - Model performance metrics (accuracy, F1, etc)
  - Output quality metrics (relevance, faithfulness)
  - Task-specific metrics (completion, correctness)

@AnilSorathiya
Copy link
Contributor Author

There should be a short explanation clarifying how the test knows where the input and output columns are declared. That way the end user will know how the input dataset is being used.

Alternatively we can also pass input_column and actual_output_column explicitly so we self document how the scorers work, even though we match the default argument values.

I tend to think that this could apply for all uses of scorers in demo notebooks actually 🤔.
Make sense. thanks.

@AnilSorathiya
Copy link
Contributor Author

The last cell runs each of the custom_metrics with:

result = metric.measure(test_case)

How does that integrate with VM? Via scorers or tests? As a user I wouldn't know how to bring those results from GEval tests to a ValidMind document.

This section has been removed from the notebook. I will create a separate notebook in this PR #434

@AnilSorathiya
Copy link
Contributor Author

Looking good 🙌 apologies for the delay reviewing this.

My suggestion would be put the golden datasets + GEval in separate notebooks. This notebook has everything we need but it can feel heavy but also trying to do multiple things at the same time.

Specifically, I'd recommend:

  • Leaving this notebook as a demonstration of integration with DeepEval LLMTestCase and SummarizationMetric
  • Another notebook that demonstrates how to use LLMAgentDataset with the Golden dataset from DeepEval
  • Another notebook that demonstrates how to use GEval with VM scorers and/or VM tests

Specifically for Golden I feel like we should define what is the actual use case we want to demonstrate here. Geval has a more clear objective here but the Golden examples with the mock LLM usage feel a bit out of place.

To expedite merging ths PR we can probably update the notebook to not have the golden datasets + GEval and come back to that on a follow up PR.

Thoughts?

yes, Agree.

  • Working on G-eval in the separate PR where I will create a separate notebook [SC 12707] Add G-eval test in lib #434
  • for Golden dataset we will have a separate notebook as well when we have clear usecase in mind.

Copy link
Contributor

@juanmleng juanmleng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great notebook! Just left a small comment.

@github-actions
Copy link
Contributor

PR Summary

This PR introduces significant enhancements and bugfixes to improve the integration between ValidMind and DeepEval. Key functional changes include:

  • Modifications to the .gitignore to support additional file formats such as *.qmd and files related to DeepEval.
  • Updates to test datasets and notebooks: Several notebooks have been updated to demonstrate new use cases covering various LLM test scenarios. This includes sample test cases for banking risk evaluation, retrieval-augmented generation (RAG) systems, and agent evaluations. Test cases now leverage DeepEval metrics such as TaskCompletion, Faithfulness, Summarization, Bias, Contextual Relevancy, Contextual Precision, and Contextual Recall.
  • Enhanced handling of tool calls in agent evaluations: The logic to extract tool calls, responses, and to integrate these into the TaskCompletion metric was refactored. This ensures that both dictionary and object formats are supported and that the tool response data is correctly processed.
  • Updates to dataset conversion: In the LLMAgentDataset conversion functions, the serialization of tool call fields has been simplified by removing redundant serialization steps.
  • Numerous improvements to code comments and documentation within the notebooks and scorer modules, making the evaluation flows clearer for users.
  • A minor version bump (2.10.1) in configuration files, which is not mentioned in the summary of functional changes.

Overall, these changes aim to streamline the testing infrastructure for LLMs, improve metric evaluations, and provide detailed insights into agent behavior through advanced scoring metrics from DeepEval.

Test Suggestions

  • Run the updated notebooks to ensure that all new DeepEval-based test cases compute scores correctly for various metrics (e.g., TaskCompletion, Summarization).
  • Validate the extraction logic for tool calls by providing both dictionary and object formatted messages and confirming correct ToolCall instantiation.
  • Execute unit tests for each new scorer module (Bias, ContextualPrecision, ContextualRecall, ContextualRelevancy, Faithfulness, Hallucination, and TaskCompletion) to verify they handle input datasets as expected.
  • Manually inspect the output of the modified LLMAgentDataset to verify that the 'tools_called' field is correctly populated without unnecessary serialization steps.

@AnilSorathiya AnilSorathiya merged commit 34ba898 into main Oct 17, 2025
17 checks passed
@AnilSorathiya AnilSorathiya deleted the anilsorathiya/sc-12254/add-new-deepeval-tests-in-lib branch October 17, 2025 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants